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Abstract 

A natural probabilistic model for motif discovery has been used to experimentally test the 
quality of motif discovery programs. In this model, there are k background sequences, and 
each character in a background sequence is a random character from an alphabet E. A motif 
G = .9152 ■ ■ • 9m is a string of m characters. Each background sequence is implanted a prob- 
abilistically generated approximate copy of G. For a probabilistically generated approximate 
CZ2 ■ copy b\bi . . .b m of G, every character hi is probabilistically generated such that the probability 

for bi ^ gi is at most a. We develop three algorithms that under the probabilistic model can 
find the implanted motif with high probability via a tradeoff between computational time and 
the probability of mutation. Each algorithm has the preprocessing part and the voting part. 
We use a pair of function (ti(n, fc), £2(n, k)) to describe the computational complexity of motif 
detection algorithm, where n is the largest length of input sequence, and k is the number of 
sequences. Function ti(n, k) is the time complexity for the part for preprocessing and ts(n, k) is 
the time complexity for recovering one character for motif after preprocessing. The total time 
is 0(ti(n,k) + t 2 {n,k)\G\). 

(1) There exists a randomized algorithm such that there are positive constants Cq and C\ that 
if the alphabet size is at least 4, the number of sequences is at least c\ logn, the motif length is 
at least cq log n, and each character in motif region has probability at most -r, — 1 \2+„ of mutation 

for some fixed \x > 0, then motif can be recovered in (0(-y=(logn)2 + h 2 log n),0(logn)) time, 

where n is the longest length of any input sequences, and h = min(|G|, tis) The algorithm total 
time is sublinear if the motif length \G\ is in the range [(logn) 7+M , / lo "u-t->J - This is the first 
sublinear time algorithm with rigorous analysis in this model. 

(2) There exists a randomized algorithm such that there are positive constants co,ci, and a 
that if the alphabet size is at least 4, the number of sequences is at least c\ logn, the motif length 
is at least cologn, and each character in motif region has probability at most a of mutation, 
then motif can be recovered in (0(j^r (logn) ' 1 )), O(logn)) time. 

(3) There exists a deterministic algorithm such that there are positive constants cq, c\, and a 
that if the alphabet size is at least 4, the number of sequences is at least c\ log n, the motif length 
is at least cologn, and each character in motif region has probability at most a of mutation, 
then motif can be recovered in (0(n 2 (log n) ' 1 -*), O(logn)) time. 

The methods developed in this paper have been used in the software implementation. We ob- 
served some encouraging results that show improved performance for motif detection compared 
with other softwares. 
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1. Introduction 

Motif discovery is an important problem in computational biology and computer science. For 
instance, it has applications in coding theory [3,6], locating binding sites and conserved regions in 
unaligned sequences [8,12,20,21], genetic drug target identification [11], designing genetic probes 
[11], and universal PCR primer design [2, 11, 16, 19]. 

This paper focuses on the application of motif discovery to find conserved regions in a set of 
given DNA, RNA, or protein sequences. Such conserved regions may represent common biological 
functions or structures. Many performance measures have been proposed for motif discovery. Let C 
be a subset of 0-1 sequences of length n. The covering radius of C is the smallest integer r such that 
each vector in {0, l} n is at a distance at most r from a string in C. The decision problem associated 
with the covering radius for a set of binary sequences is NP-complete [3] . The similar closest string 
and substring problems were proved to be NP-hard [3, 11]. Some approximation algorithms have 
been proposed. Li et al. [14] gave an approximation scheme for the closest string and substring 
problems. The related consensus patterns problem is that given n sequences s±, ■ ■ ■ , s n , find a region 
of length L in each Sj, and a string s of length L so that the total Hamming distance from s to these 
regions is minimized. Approximation algorithms for the consensus patterns problem were reported 
in [13]. Furthermore, a number of heuristics and programs have been developed [1, 9, 10, 18, 22]. 

In many applications, motifs are faint and may not be apparent when two sequences alone 
are compared but may become clearer when more sequences are compared together [7]. For this 
reason, it has been conjectured that comparing more sequences together can help with identifying 
faint motifs. This paper is a theoretical approach with a rigorous probabilistic analysis. 

We study a natural probabilistic model for motif discovery. In this model, there are k back- 
ground sequences and each character in the background sequence is a random character from an 
alphabet X. A motif G = g\gi ■ ■ ■ g m is a string of m characters. Each background sequence is 
implanted a probabilistically generated approximate copy of G. For a probabilistically generated 
approximate copy &1&2 ■ ■ -b m of G, every character bi is probabilistically generated such that the 
probability for bi 7^ gi, which is called a mutation, is at most a. This model was first proposed in 
[18] and has been widely used in experimentally testing motif discovery programs [1, 9, 10, 22]. We 
note that a mutation in our model converts a character gi in the motif into a different character bi 
without probability restriction. This means that a character gi in the motif may not become any 
character bi in X — {gi} with equal probability. 

We develop three algorithms that under the probabilistic model, one can find the implanted 
motif with high probability via a tradeoff between computational time and the probability of 
mutation. Each algorithm has the preprocessing phase and the voting phase. We use a pair of 
function {t\{n, k),t2(n, k)) to describe the computational complexity of motif detection algorithm, 
where n is the largest length of input sequence, and k is the number of sequences. Function 
t\{n, k) is the time complexity for the part for preprocessing, and t2(n, k) is the time complexity for 
recovering one character for motif after preprocessing. The total time is 0{t\(n, k) + t2(n, k)\G\). 

(1) There exists a randomized algorithm such that there are positive constants cq and c\ that 
if the alphabet size is at least 4, the number of sequences is at least c\ logn, the motif length is at 
least Co log n, and each character in motif region has probability at most q q n \'2+^ °f mutation for 

some fixed \x > 0, then motif can be recovered in (0(-y=(log n) 2 + h 2 log n), 0(log n)) time, where 
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n is the longest length of any input sequences, and h = min(|G|,ns) The algorithm total time is 
sublinear if the motif length |G| is in the range [(logn) 7+/i , -r, — "s 1+M ]. This is the first sublinear 
time algorithm with rigorous analysis in this model. 

(2) There exists a randomized algorithm such that there are positive constants c$,c\, and a 



that if the alphabet size is at least 4, the number of sequences is at least c\ log n, the motif length 
is at least cq log n, and each character in motif region has probability at most a of mutation, then 
motif can be recovered in ((^(^(logn) ^ 1 )), O(logn)) time. 

(3) There exists a deterministic algorithm such that there are positive constants co,ci, and a 
that if the alphabet size is at least 4, the number of sequences is at least c\ log n, the motif length 
is at least cq log n, and each character in motif region has probability at most a of mutation, then 
motif can be recovered in (0(n 2 (log n) ^ 1 '), O(logn)) time. 

The research in this model has been reported in [4, 5, 15]. In [4], Fu et al. developed an algorithm 
that needs the alphabet size to be a constant that is much larger than 4. In [5], our algorithm 
cannot handle all possible motif patterns. In [15], Liu et al. designed algorithm that runs in 0(n 3 ) 
time and is lack of rigorous analysis about its performance. The motif recovery in this natural and 
simple model has not been fully understood and seems a complicated problem. 

This paper presents two new randomized algorithms and one new deterministic algorithm. They 
make advancements in the following aspects: 1. The algorithms are much faster than those before. 
Our algorithms can even run in sublinear time. 2. They can handle any motif pattern. 3. The 
restriction for the alphabet size is as small as four, giving them potential applications in practical 
problems since gene sequences have an alphabet size 4. 4. All algorithms have rigorous proofs 
about their performances. 

The entire Recover-Motif is described in Section 4.2. We analyze Algorithm Recover-Motif in 
Section 5. 

2. Notations and the Model of Sequence Generation 

For a set A, \\A\\ denotes the number of elements in A. X is an alphabet with ||X|| = t > 2. 
For an integer n > 0, X n is the set of sequences of length n with characters from X. For a 
sequence S = a\Cb2 ■ ■ ■ a n , S[i] denotes the character ai, and S[i, j] denotes the substring ai ■ ■ ■ aj 
for 1 < i < j < n. \S\ denotes the length of the sequence S. We use to represent the empty 
sequence, which has length 0. 

Let G = g±g2 ■ ■ ■ g m be a fixed sequence of m characters. G is the motif to be discovered by our 
algorithm. A ® a (n, G)-sequence has the form S = a\ ■ ■ ■ a ni b\ ■ ■ ■ b m a ni+ \ ■ ■ ■ a n2 , where n-2+m < n, 
each ai has probability j to be equal to tt for each it S X, and bi has probability at most a not 
equal to gi for 1 < i < m, where m = \G\. H(S') denotes the motif region b\ ■ ■ ■ b m of S. A mutation 
converts a character gi in the motif into an arbitrary different character bi without probability 
restriction. This allows a character gi in the motif to change into any character bi in E — {gi} with 
even different probability. The motif region of S may start at an arbitrary or worst-case position 
in S. Also, a mutation may convert a character gi in the motif into an arbitrary or worst-case 
different character bi only subject to the restriction that gi will mutate with probability at most a. 

A ^(n, G)-sequence has the form S = a\ ■ ■ ■ a ni b\ ■ ■ ■ b m a 1ll+ i • • • a„ 2 , where ri2 + m < n, each 
ai has probability j to be equal to it for each it G X, and there are at most O(l) characters bi not 
equal to gi for 1 < i < m and each mutation occurs at a random position of G, where m = \G\. 

For two sequences S\ = a± ■ ■ ■ a m and S2 = b\ ■ ■ ■ b m of the same length, let the relative Hamming 
distance diS(S 1 ,S 2 ) = Iftl^y.-.")}! . 

Definition 1. For two intervals [ii,j'i] and ^2^2], define shift([?i, j±], [22, J2]) = min(|ii — *2 1 , \ji — 

h\). 



3. Brief Introduction to Algorithm 

Every detection algorithm in this paper has two phases. The first phase is preprocessing so that 
the motif regions from multiple sequences can be aligned in the same column region. The second 
phase is to recover the motif via voting. We use a pair of function (£i(n, k), t^in, k)) to describe the 
computational complexity of motif detection algorithm. Function ti(n,k) is the time complexity 
for the preprocessing phase and t2(n,k) is the time complexity for outputting one character for 
motif in the voting phase. 

The motif G is a pattern unknown to algorithm Recover-Motif, and algorithm Recover- Motif will 
attempt to recover G from a series of 0(n, G, a)-sequences generated by the probabilistic model. 

3.1. Algorithm 

The algorithm first detects a position that is close to the left motif boundary in a sequence. It finds 
such a position via sampling and collision between two sequences. After the rough left boundary a 
sequence is found, it is used to find the rough boundaries of the rest of the sequences. Similarly, we 
find those right boundaries of motif among the input sequences. The exact left boundary of each 
motif region will be detected in the next phase via voting. Each character of the motif is recovered 
by voting among all the characters at the same positions in the motif regions of input sequences. 

Descriptions of Algorithm 

Input: Z = Z\ U Z2, where Z\ = {S[, • • • , S' 2kl } and Zi = {S'{, • • • , S'^, } are two sets of input 
sequences. 

Output:Planted motif in each sequence and consensus string 

Start: 

Randomly select sample points from each sequence both in Z\ and Z<i 

For each pair of sequences selected from Z\ and Z2 , 
Find the rough left and rough right boundaries. 
Improve rough boundaries. 

If motif boundaries of each sequence in Z2 are not empty, 
Use Voting algorithm to get the planted motifs. 

End of Algorithm 

3.2. An Example 

We provide the following example for the brief idea of our algorithm. Let the following input strings 
be defined as below. We assume that the original motif is TTTTTAACGATTAGCS. The motif 
part is displayed with bold font, and the mutation characters in the motif region are displayed with 
small font. 

3.2.1. Input Sequences 

It contains two groups Z\ = {S[, S' 2 } and Z2 = {S'{, S 2 , S3, S", S"}. 

Zi: 
S[ = GTACCATGGATTATTAACGATTAGCSTAGAGGACCTA. 
S 2 = AATCCTTACTTTTAACGATTAGCSGTC. 

The above two strings are used to detect the initial motif region and use them to deal with the 
motif in the second group below. 



Z 2 : 

S'[ = ATTCGATCCAGTTTTTAACGGTTAGCSCAATTACTTAG. 

S% = GC ATTGC ATTTTTT A ACG ATT ACCSGTACTTAGCT AG ATC. 

S% = TCAGGGCATCGAGACTTTTTAGCGATTAGCSCTAGAATCAGACCT. 

S'l = GTACCTGGCATTGAACGTTTTTAACGATTAGCATGCAGATGGACCTTTA. 

S'l = A AT GG AT C AG ATTTTT A ACG ATT CGCSCT AG ATT C AG. 

3.2.2. Select Sample Points 

Some sample points of two sequences in Z\ are selected and marked. 

S[ = GTACCATGG ATT ATTAACG ATT AGCST AG AGG ACCT A. 

s 2 = AatccttActtttaacgattagcsgtc. 

3.2.3. Collision Detection 

In this step, the left and right rough boundaries of two sequences will be marked. The following 
show the left collision, which happens nearby the left motif boundary and are marked by two 
over line TATT and TTTT subsequences. 



S[ = GT AC C ATGG ATT ATTAACG ATT AGCST AG AGG AC CT A. 

s' 2 = AatccttActtttaacgattagcsgtc. 

The following show the right collision, which happens nearby the right motif boundary and are 
marked by two overline TTAG subsequences. 



S[ = GTACCATGGATTATTAACGATTAGCSTAGAGGACCTA. 

s 2 = AatccttActtttaacgattagcsgtc. 

3.2.4. Improving the Boundaries 

In the early phase of the algorithm, we first detect a small piece of motif in S[ by comparing S[ and 
S 2 . Assume "TATT" and "TTAG" are found in the left and right motif region of S[ respectively. 
The rough motif length will be calculated via the difference of the location first character 'T' of the 
first subsequence and the location of the last character 'G' of the second subsequence. The position 
marked by "A" is the rough left boundary of motif and the position marked by "T" is the rough 
right boundary of motif in S[ below. 

S[ = GTACC ATGG ATT ATTAACG ATTAGCST AG AGG ACCT A. 
S 2 = AATCCTTACTTTTAACGATTAGCSGTC. 

3.2.5. Select Sample Points for the Sequences in Z^ 

Some sample points near the motif boundaries of S[ are selected. 

S'{ = GTACC ATGG ATT ATTAACGATT AGCST AG AGG ACCT A. 
Sample points are selected in each sequence in Z 2 . 



s'l = attcgatccAgtttttaacggttagcscAattacttAg. 
s'l = gcattgcattttttaacgattaccsgtActtAgctAgatc. 

S'l = TCAGGGCATCGAGACTTTTTAGCGATTAGCSCTAGAATCAGACCT. 

S'l = GT AC CTGGC AT f G AAC GTTTTT A ACG ATT AGCATGCACATGG AC CTTT A. 

S'l = AATGGATCAGATTTTTAACGATTCGCSCTAGATTCAG. 

3.2.6. Collision Detection Between S[ with the Sequences in Z2 

Some sample points near the motif boundaries of S[ are selected. 

S'l = gtaccatggattattaacgattagcstAgaggaccta. 

Sample points are selected in each sequence in Z<i- 



S'{ = ATTCGATCCAGTTTTTAACGGTTAGCSCAATTACTTAG. 

S'l = GC ATTGC ATTTTTTAACG ATT ACCSGTACTTAGCTAG ATC. 

S'l = TCAGGGC ATCG AG ACTTTTTAGCGATTAGCSCT AG AATCAGACCT. 



S'l = GTACCTGGCATfGAACGTTTTTAACGATTAGCATGCAGATGGACCTfTA. 



S'l = AATGGATCAGATTTTTAACGATTCGCSCTAGATTCAG. 

3.2.7. Improving the Motif Boundaries for the Sequences in Z2 

After the collision with the sequences in Z2, we obtain the rough location of motifs of the sequences 
in Z2- Their motif boundaries for the sequences in Z2 are improved. 

S'{ = GT ACC AT GG ATT ATT A ACG ATT AGCST AG AGG AC CT A. 

The improved motif boundaries of the sequences in Z2 are marked below. 

S'l = ATTCGATCCAGTTTTTAACGGTTAGCSCAATTACTTAG. 

S' 2 ' = GCATTGCATTTTTTAACGATTACCSGTACTTAGCTAGATC. 

S'l = TCAGGGC ATCG AG ACTTTTTAGCGATTAGCSCT AG AATCAGACCT. 

S'l = GTACCTGGCATTGAACGTTTTTAACGATTAGCATGCAGATGGACCTTTA. 

S'l = AATGGATCAGATTTTTAACGATTCGCSCTAGATTCAG. 

3.2.8. Motif Boundaries for the Sequences in Z2 



S'l = GTACCATGGATTATTAACGATTAGCSTAGAGGACCTA. 

Use the pair (Gl, Gr) with Gl = TTAT and Gr = AGCS to find the motif boundaries in the 
sequences of Z2. The rough boundaries of the second group is marked below with underlines. 

S'l = ATTCGATCCAGTTTTTAACGGTTAGCSCAATTACTTAG. 

S'l = GCATTGCATTTTTTAACGATTACCSGTACTTAGCTAGATC. 

S'l = TCAGGGC ATCG AG ACTTTTTAGCGATTAGCSCT AG AATCAGACCT. 

S'l = GTACCTGGCATTGAACGTTTTTAACGATTAGCATGCAGATGGACCTTTA. 

S'l = AATGGATCAGATTTTTAACGATTCGCSCTAGATTCAG. 



3.2.9. Extracting the Motif Regions 

The motif regions of the second group will be extracted. The original motif is recovered via voting 
at each column. 

G'[ = TTTTTAACGGTTAGCS 

G" 2 ' = TTTTTAACGATTACCS 

G" 3 ' = TTTTTAGCGATTAGCS 

G'i = TTTTTAACGATTAGCA 
G „ = TTTTTAACGATTCGCS 

3.2.10. Recovering Motif via Voting 

The original motif TTTTTAACG ATTAGCS is recovered via voting at all columns. For example, 
the last S in the motif is recovered via voting among the characters S, S, S, A, S in the last column. 

3.3. Our Results 

We give an algorithm for the case with at most j-, ^itji mutation rate. The performance of the 

algorithm is stated in Theorem 2. Theorem 2 implies Corollary 3 by selecting k = cilogn with 
some constant c\ large enough. 

Theorem 2. Assume that \i is a fixed number in (0, 1) and the alphabet size t is at least 4. There 
exists a randomized algorithm such that there is a constant cq that if the length of the motif G is at 
least Co log n, then given k independent ®(n,G, q- -tt+jt ) -sequences, the algorithm outputs G' such 
that 

1) with probability at most e~ n ( k \ \G'\ / \G\, and 

2) for each 1 < i < \G\, with probability at most e~^ k > , G'[i] ^ G[i], and 

3) with probability at most -^, the algorithm Recover-Motif does not stop in {0{k{-j={\ogn) : 2 + 
h 2 log n)),0(k)) time, 

where n is the longest length of any input sequences, and h = min(|G|,n5). 



Corollary 3. There exists a randomized algorithm such that there are positive constants co,c\ and 
H that if the alphabet size is at least 4, the number of sequences is at least c\ log n, the motif length 
is at least cologn, and each character in motif region has probability at most j-, ^^ of mutation, 

then motif can be recovered in (0(^(logn)2 + h 2 log n),0(logn)) time, where n is the longest 

length of any input sequences, and h = min(|G|,n5). 

We give a randomized algorithm for the case with S7(l) mutation rate. The performance of the 
algorithm is stated in Theorem 4. Theorem 4 implies Corollary 5 by selecting k = c\ log n with 
some constant c\ large enough.. 

Theorem 4. Assume that the alphabet size t is at least 4. There exists a randomized algorithm 
such that there is a constant cq that if the length of the motif G is at least cq log n, then given k 
independent @(n,G, fi))- sequences, the algorithm outputs G' such that 

1) with probability at most e ( k > , \G'\ ^ \G\, and 

2) for each 1 < i < \G\, with probability at most e~ n( - k \ G'[i] ^ G[i], 



3) with probability at most — '%, the algorithm Recover-Motif does not stop in (0(k(jj^r(logn) ^' + 
h 2 )),0(k)), 

2 

where n is the longest length of any input sequences, and h = min(|G|, ns). 



Corollary 5. There exists a randomized algorithm such that there are positive constants Co, c\, and 
a that if the alphabet size is at least 4 ; the number of sequences is at least c\ log n, the motif length 
is at least cologn, and each character in motif region has probability at most a of mutation, then 
motif can be recovered in (0(-Sjr(logn)°( 1 '), O(logn)) time. 

We give a deterministic algorithm for the case with 0(1) mutation rate. The performance of 
the algorithm is stated in Theorem 6. Theorem 6 implies Corollary 7 by selecting k = c\ log n with 
some constant c\ large enough. 

Theorem 6. Assume that the alphabet size t is at least 4. There exists a deterministic algorithm 
such that there is a constant c$ that if the length of the motif G is at least cq log n, then given k 
independent @(n,G, ^i))- sequences, algorithm runs in (0(n 2 (log n) ^ 1 ' + h 2 k),0(k)), and outputs 
G' such that 

1) with probability at most e ^ k ' , \G'\ ^ \G\, and 

2) for each 1 < i < \G\, with probability at most e -^( fc ) ; G'[i] / G[i], 

3) with probability at most \, the algorithm Recover-Motif does not stop in (O (k(n 2 (log n) ^ 1 ' + 
h 2 )),0(k)) time, 

2 

where n is the longest length of any input sequences, and h = min(|G|,ns). 



Corollary 7. There exists a deterministic algorithm such that there are positive constants cq,c\, 
and a that if the alphabet size is at least A, the number of sequences is at least c\ logn, the motif 
length is at least cq log n, and each character in motif region has probability at most a of mutation, 
then motif can be recovered in (0(n 2 (logn) W),0(logn)) time. 

4. Algorithm Recover-Motif 

In this section, we give an unified approach to describe three algorithms. The performance of the 
algorithms is stated in the Theorems 2, 4, and 6. The description of Algorithm Recover-Motif is 
given at section 4.2. The analysis of the algorithm is given at section 5. 

4.1. Some Parameters 
Definition 8. 

i. Constant x is selected to be 10. This parameter controls the failure probability of our algo- 
rithms to be at most ^. 

ii. The size of alphabet is t that is at least 4. 

iii. Select a constant pa G (0, 1) to have inequality (1) 

Po < -2p (1) 



iv. The constant e £ (0, 1) is selected to satisfy 

e<min ((^l_( 2/90 + 26)), 1(1 - JL - 1), I). (2) 

The existence of e follows from inequality (1). The constant e is used to control the mutation 
in the motif area. It is a part of parameter f3 defined in item (xiv) of this definition. 

_£ 
v. Let c = e 3 . The constant c is used to simply probabilistic bounds which are derived from 

the applications of Chernoff bounds (See Corollary 17). 
vi. Define r{y) = (-^ + _^_). 
vii. Define u\ to be a large constant that for all v > 0, 

2(v + ui)c"+ ui < _J_ 



(1 - c) 2 ~ 5 • 2 a 



viii. Select constant pi G (0, 1) such that 

^T + ^ + ^ + PKL (4) 

The existence of pi follows from e < ^(1 — — j — ^-), which is implied by inequality (2). 

ix. Select constant p2 £ (0, 1) and constant positive integer v large enough such that 

6(v + ui)c v 

: h 92 < Pi, and (5) 

1 — c 

(J ? + ( „ + Ul) _fL + _^_ + _i_)< 1/2 . (6 ) 



x. Define % = ±, and <p(v) = (v + Ui)(j^ + ^) 
xi. Select constant «o such that 



4(t> — l)ao + «o < P2; an d (7) 

a < pq. (8) 



Adding inequalities (4), (5), and (7), we have inequality (9) 



,2 4 N 6(v + u 1 )c v , , 

t — 1 2 X 1 — c 

By arranging the terms in inequality (9) and the definitions of r(v) and ip(v), we have in- 
equality (10) 



2((2(v - l)a + t^— ) + r(v) + 2(q> + <p(v)) + 2e) + (a + e) < 1. (10) 

1 — c 



xii. The maximal mutation rate a for the second algorithm (Theorem 4) and third algorithm 
(Theorem 6) are selected as uq. Since the mutation rate of our sublinear time algorithm is 
bounded by ^ n \'2+v ; the maximal mutation rate a for the first algorithm (Theorem 2) is less 
than cto when n is large enough. We always assume that all mutation rates a in our three 
algorithms are in the range (0, «o]- 

xiii. Define q(y) = 2(v — \)a + y^. By inequality (10), the definition of q(y), and the fact 
a £ (0, Qo), we have 

2(q(v) + r(v) + 2( ?0 + <p(v)) + 2e) + (q + e) < 1. (11) 

Inequality (11) implies q(v) < ^. By inequality (6), we have that 

(1 + (t; + ui)-^- + -f- + -1—) + q(v) < 3/4 (12) 

2 X 1 — c 1 — c 5 • 2 X 

xiv. Let /3 = 2a + 2e. The parameter (3 controls the similarity of N(S') and the original motif G 
(see Lemma 26). 

xv. Define R = r(v). 

xvi. We define the following Qq. 

Qo = q(v). (13) 

The parameter Qq used in Lemma 26 gives an upper bound of the probability that a 
Q(n, G, a)-sequence S whose H(S') will not be similar enough to the original motif G ac- 
cording to the conditions in Lemma 26. 



xvii. Select constant do such that 



5-2 

xviii. Select constant d\ such that (v + ui)c dllogn < g^. 
xix. Select number ui such that 



n 3 c <iologn < _L_ ( M ) 



x 



C V+U'Z -^ 

(di\ogn)(v + ui)- < ——-.and (15) 

1 — c 5 • 2 X 

c v+u 2 1 



Since only n is variable, we can make «2 = O (log log n). 

ln± 

xx. For a fixed c G (0, 1), define S c = —^ L . 



10 



4.2. Description of Algorithm Recover-Motif 

The algorithm is described in this section. Before presenting the algorithm, we define some notions. 

Definition 9. 

• Two sequences X\ and X2 are weak left matched if (1) both |Xi| and IX2I are at least do logn, 
(2) diff(X 1 [l,i],X 2 [l,?]) < ft for all integers i, v < i < dologn. 

• Two sequences X\ and X2 are left matched if (1) do logn < \X±\, \X2\, (2) X\[i] = X2[i] for 
i = 1, ■ ■ ■ ,v — 1, and (3) diff(Xi[l,i],X2[l,i]) < ft for all integers i, v < i < do logn. 

• Two sequences X\ and X2 are weak right matched if X± and A 2 are weak left matched, 
where X R = a n - • • a\ is the inverse sequence of X = a\ ■ ■ ■ a n . 

• Two sequences X\ and X2 are right matched if X R and X R are left matched, where X = 
a n ■ ■ ■ a\ is the inverse sequence of X = a\ ■ ■ ■ a Ti 



,: n ■ 



• Two sequences X\ and X2 are matched if X\ and X2 are both left and right matched. 

Variable L will be controlled in the range L G [(logn) 3+<El , n^~ e2 \ in our algorithm with high 
probability. We define the following functions that depend on L. 



Definition 10. Define M(L) = ^ 3 ^- +x VZlogn. Define Mi(L) = <5c ^ L) (see Definition 8 for 
S c ), where c = \. 

We would like to minimize the function (JM + L 2 )logn. This selection can make the total 
time complexity sublinear. 

Definition 11. For a a (n, G) sequence S, define LB(«S) to be the left boundary / of the motif 
region N(S') in S, and RB(S') to be the right boundary r of the motif region H(S) in S such that 
H(S) = S[l,r\. 

4.2.1. Boundary-Phase of Algorithm Recover-Motif 

The first phase of Algorithm Recover-Motif finds the rough motif boundaries of all input sequences. 
It first detects the rough motif boundaries of one sequence via comparing two input sequences. Then 
the rough boundaries of the first sequence is used to find the rough motif boundaries of other input 
sequences. 

Three algorithms share most of the functions. We have a unified approach to describe them. A 
special variable "algorithm- type" selects one of the three algorithms, respectively. 

Definition 12. Let algorithm- type represent one of the three algorithm types, "RANDOMIZED- 
SUBLINEAR", "RANDOMIZED-SUBQUADRATIC", and "DETERMINISTIC-SUPERQUADRATIC". 

Definition 13. Assume that A\ is a set of positions in a Q (n, G) sequence S\ and A2 is a set of 
positions in a Q (n, G) sequence S^. If there is a position a\ G A\ and a 2 £ A2 such that for some 
position j with 1 < j < \G\, a\ is the position of N(Si)[j] in S\ and a 2 is the position of H(S , 2)[j] in 
S2, then A\ and A2 have a collision at (01,02)- 
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In the following function Collision-Detection, the parameter uj < f3 is defined below in the three 
algorithms. 



UJ 



algorithm-type 



if algorithm-type=RANDOMIZED-SUBLINEAR; 

/3 if algorithm-type=RANDOMIZED-SUBQUADRATIC; (17) 

j3 if algorithm- type=DETERMINISTIC-SUPERQUADRATIC. 



Collision-Detection(5i, U±, S2, U2) 

Input: a pair of 0(n, G, a)-sequences S\ and S2, Ui is a set of locations in Si for i = 1, 2. 

Output: the left and right rough boundaries of two sequences. 

Let D\ be all subsequences S\ [a, a + do log n — 1] of Si of length do log n with a € U\ . 

Let Z?2 be all subsequences 52 [b, b + do log n — 1] of £2 of length do log n with b G C/2 • 

Find two subsequences Xi = Si[ai, ai + do logn — 1] € D\ and 

A2 = S*2 [61 , &i+do logn— 1] G L>2 such that ai is the least and diff (Xl, X 2 ) < Waigorithm-type- 

Find two subsequences X[ = Sila^, a^ + do logn - 1] 6 Di and 

X2 = S2 [b'i , b'i + do log n — 1] G -D2 such that a'i is the largest and 

diS(X 1 ,X 2 ) < ^algorithm-type- 

Find two subsequences Y\ = <Si[/i, /1 + do logn - 1] 6 Di and 
^2 = 52 [ei ,ei + do log n - 1] 6 D2 such that ei is the least and 

diff (Yi,!^) < ^algorithm-type- 

Find two subsequences Y[ = Si [f[ ,f[ + do log n — 1} £ Di and 
Y 2 = S2 [e' x , e^ + do log n — 1] G -D2 such that e[ is the largest and 

diff(y i ,y 2 ) < ^ algorithm-type- 

Return (a, a', ei, e^). 
End of Collision-Detection 

Function Point-Selection(5i, S2,L) will be defined differently in three different algorithms. It 
selects some positions from each interval of length L in both S\ and 62- 

Point-Selection^, L, 7) 

Input: a pair of 0(n, G, a)-sequences 5, a size parameter L of partition, and an interval of 
positions / in S. 

Output: a set U of positions from S respectively. 
Steps: 
Let [7 = 0. 

If algorithm-type=RANDOMIZED-SUBLINEAR or RANDOMIZED-SUBQUADRATIC 
If(L>fep) 

For each interval I' in /, partition I' into intervals of size L. 
Sample M(L) random positions at every 

interval of size L derived in the above partition, and put them into U. 
Else 

Put every position of / into U\ . 
If algorithm-type=DETERMINISTIC-SUPERQUADRATIC 

Put every position of I into U. 
Return U . 
End of Point-Selection 

Improve- Boundaries (Si, ai, a r , S2, fi, f r ,L) 
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Input: a 0(n, G, a)-sequence Si with rough left and right boundaries a\ and a r , a Q(n,G, a)- 
sequences S2 with rough left and right boundaries // and f r , and the rough distance L to the nearest 
motif boundary from those rough boundaries. 

Output: improved rough left and right boundaries for both S± and S2. 
Steps: 

Find two subsequences X\ = S\ [a\ , a\ + do log n — 1] and X2 = S2 [62 , &2 + do log n — 1] 

with ai G [a; — L,ai + L] and 62 £ [// ~ L, fi + L] such that diff (Xi, X2) < (3 and ai is 

the least. 

Find two subsequences X[ = Sifa^, a' x + do logn — 1] and X 2 = S2 [b' 2 , b' 2 + do logn — 1] 

with a'i G [a r — L,a r + L] and 62 G [f r — L, f r + L] such that diff (X{, X' 2 ) < f3 and a'i is 

the largest. 

Find two subsequences Y\ = S\ [e± , e\ + do log n — 1] and I2 = £2 1/2; /2 + do log n — 1] 

with ei G [a/ — L,ai + L] and /2 G [// — L, fi + L] such that diff (Y"i, I2) < /3 and fi is 

the least. 

Find two subsequences Y[ = S\ [e'i , e[ + do log n — 1] and Y^ = S2 [fb fi + do log n — 1] 

with e' x G [a r — L, a r + L] and f 2 G [/ r — L, / r + L] such that diff (Y/, Y 2 ') < /3 and /^ is 

the largest. 

Return (01, a[, / 2 , f 2 ). 
End of Improve-Boundaries 

Initial- Boundaries (Si, S2) 

Input: a pair of 0(n, G, a)-sequences Si and S2 

Output: rough left boundary roughLeft Sl of Si, right boundary roughRight Si of Si, rough left 
boundary roughLeft^ of S2, and right boundary roughRight 52 of S2. 
Steps: 

Let Ui = U 2 = 0. 
Let L = n 2 / 5 . 
Repeat 

Let U\ =Point-Selection(Si, L, [1, |Si|]). 
Let U2 =Point-Selection(S2, L, [1, IS2I]). 
Let (Ls 1 , Rs-i, Ls 2 , Rs 2 ) =Collision-Detection(Si,C/i,S2, t^)- 
If (L Sl ± and R Sl + 0) 
Then Goto H. 
Else L = L/2. 
Until (L < i^p) 

H: Return Improve-Boundaries(Si, Lg v Rs 1 , S2,Ls 2 ,i?5 2 , 2L). 
End of Initial-Boundaries 

Motif-Length- And-Boundaries(Zi) 

Input: Z\ = {S[, • • • , S' 2ki } is a set of independent 0(ro, G, a) sequences. 

Steps: 

For i = 1 to k\ 

let (roughLeft^/ , roughRighty )=Initial-Boundaries(S2j„i , S' 2i ) . 

Let L\ be the median of U^j^KroughRight^/ — roughLeft s / )}. 

Return L±. 

End of Motif-Length- And-Boundaries 
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4.2.2. Extract-Phase of Algorithm Recover-Motif 

After a set of motif candidates W is produced from Boundary-Phase of algorithm Recover-Motif, 
we use this set to match with another set of input sequences to recover the hidden motif by voting. 

Match(G h G r ,Si) 

Input: a motif left part G[ (which can be derived from the rough left boundary of an input 
sequence S), a motif right part G r , a sequence S'l from the group Z 2 , with known rough left and 
right boundaries. 

Output: either a rough motif region of S'l , or an empty sequence which means the failure in 
extracting the motif region N(Sf ) of S'/. 
Steps: 

Find a position a in S" with roughLeft^// < a < roughLeft^// + (v + u 2 ). 

such that G[ and S'l [a, a + \Gi\ — 1] are left matched (see Definition 9). 

Find a position b in S'l with roughRighto// — (v + -u 2 ),roughRighta//) < b < roughRighto// 

i i i 

such that G r and S'!\b — \G r \ + 1,6] are right matched (see Definition 9). 
If both a and 6 are found 
Then output S'l [a, b] 
Else output (empty string). 
End of Match 

Extract(Gj,G r ,Z 2 ): 

Input Z2 = {S'{, S'2, ■ ■ ■ , S'l } and their rough left boundaries and rough right boundaries. 

Steps: 

For each S'l with i = 1, 2, ■ ■ ■ , &2, 
let G'( = M&tch(G h G r ,S'l). 

Return (G'/,G 2 ',---,G'4). 
End of Extract 

The following is Extract-Phase of algorithm Recover-Motif. It extracts the motif regions of 
another set Z2 of input sequences. 
Extract-Phase(S", Z 2 ): 

Input S' is an input sequence with known roughLeft^/ and roughRight^/ for its rough left and 
right boundaries respectively, and Z2 = {S'{ , • • • , S'l 2 } is a set of input sequences. 
Steps: 

For each subsequence Gi = S'[a, a+do log n— 1] with a G [roughLeft^/, roughLeft S / + (i;+ui)] 
and G r = S'[b — do log n + 1, b] with b £ [roughRight 5 / — (v + ui), roughRight S /] 
let {G'{ , G' 2 ' , ■ ■ ■ , G" fc ' 2 ) be the output from Extract (Gj , G r , Z 2 ) . 
If the number of empty sequences in G'{, ■ ■ ■ , G" k2 is at most (Qo + (R + 2e))k2 
Then return {G'{ , G' 2 ' , ■ ■ ■ , G'l 2 ) . 
Return (empty set). 
End of Extract-Phase 

4.2.3. Voting-Phase 

The function Vote(G"/, G" 2 ', • • • , G'l 2 ) is to generate another sequence G' by voting, where G'[i] is the 
most frequent character among G'[ [i] , G' 2 ' [i],- • • , G" k2 [i] . 

Voting-Phase(G' 1 / , G'i, ■ ■ ■ , G'Q 

Input: @(n, G, a) sequences G'{, G 2 ', • • • , G' fc ' 2 of the same length m. 
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Output: a sequence G' , which is derived by voting on every position of the input sequences. 
Steps: 

For each j = 1, ■ ■ ■ , m 

let cij be the most frequent character among G'{[j], • • • , G'l. 2 [j}. 

Return G' = a± ■ ■ ■ a m . 
End of Vote 

4.2.4. Entire Algorithm Recover- Mot if 

The entire algorithm is described below. We maintain the size of Z\ and Z2 to be roughly equal, 
which implies 

|Zi| = 0(|Z 2 |) (18) 

Algorithm Recover- Mot if (Z) 

Input: Z = Z\ U Z2, where Z\ = {S[, • • • , S' 2kl } and Z2 = {S'{, • • • , S'^} are two sets of input 
sequences. 
Steps: 
Preprocessing Part: 

For each S £ Z\ U Z2, let roughLeftg = roughRightg = (the two boundaries are unknown). 
l mo tif =MotifLengthAndBoundaries(Zi). 

Let L = lmotif/^- 

For i = 1 to fci, 

let L/g/. =Point-Selection(5 2 j_ 1 , i, [roughLeft^/ — 2L, roughLefty + 2L])U 
Point-Selection(S , 2 j_ 1 , L, [roughRighty — 2L, roughRight 5 / + 2L]). 
For j = 1 to /c 2 

let C/ s » =Point-Selection(S^,L, [1,|SJ'|]). 
For i = 1 to k\ 

For each 5'' G Z2 

Let ( L S^_! ' -^-i ' L sr',Rs>') =Collision-Detection(S' 2i _ 1 , J7 S /._ i , SJ', C/ 5 »). 
Let (^-1 ' Rs 2i-i ' rou g nLeft S^ roughRight s //) = 

Improve-Boundaries(5 2i _ 1 ,L, s ./ ,R S > , , S'-,L s n, R s », 2L). 

£% — 1 ^2 — 1 3 3 

Let (G'/, G 2 ', ■ ■ ■ , G' fc ' 2 ) be the output from Extract-Phase(S , 2 j_ 1 , ^2). 

If (G^,---^) is not empty 

Then go to Voting Part. 
Voting Part: 

Return Voting-Phase^'/, G' 2 ', • • • , G'Q. 
End of Algorithm Recover-Motif 

5. Analysis of Algorithm 

The correctness of the algorithm will be proved via a series of Lemmas in Sections 5.2 and 5.3. 
Section 5.2 is for Boundary-Phase and Section 5.3 is for Extract-Phase. Furthermore, Section 5.3 
gives some lemma for the two randomized algorithms and Section 5.5 gives the proof for the 
deterministic algorithm. 
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5.1. Review of Some Classical Results in Probability 

Some well known results in classical probability theory are listed. The readers can skip this section 
if they understand them well. The inclusion of these results make the paper self-contained. 

• For a list of events A 1: ---, A m , PrL4i U A 2 U • • • U A m ] < Pr[^i] + Pr[A 2 ] H h PrL4 m ]- 

• For two independent events A and B, Pr[A n B] = Pi[A]Pi[B]. 

• For a random variable Y, Pr[Y > t] < — M for all positive real number t. This is called 
Markov inequality. 

The analysis of our algorithm employs the Chernoff bound [17] and Corollary 17 below, which 
can be derived from it (see [14]). 

Theorem 14 ([17]). Let X±,- ■ ■ ,X n be n independent random 0-1 variables, where Xi takes 1 
with probability pi. Let X = J2?=i Xi, and n = E[X]. Then for any 5 > 0, 

$. Pt(X < (1 - S)ii) < e-^ 5 \ and 

a. Pv(x > (i + s)fj) < 



(l+<5)(i+«) 

We follow the proof of Theorem 14 to make the following version of Chernoff bound so that it 
can be used in our algorithm analysis. 

Theorem 15. Let X±, ■ ■ ■ , X n be n independent random 0-1 variables, where Xi takes 1 with prob- 

r a s i vi- 
ability at most p. Let X = J2?=i X%- Then for any 5 > 0, Pr(X > (1 + 5)pn) < 



(l+5)(i+«) 

Proof: Let y be an arbitrary positive real number. By the definition of expectation, we have 
E(e yXi ) = Pr(Xi = i)e y + Pr(Xj = 0). Since the function f(x) = xe y + (1 — x) is increasing for all 
y > and Pr(JQ = 1) < p, we have E(e yXi ) < pe y + (1 — p). We have the following inequalities: 

E(e yX ) 
Pr(X>(l + 5)pn) < -^± (19) 

< Y\UE{e y ^) 

^ e y(l+6)pn ^ U > 

U?=i(pe y + 1-P) (21) 



e y(l+S)pn 

n? = i(l+p(e«-l)) 

e y(l+5)pn 



(22) 



n™-i e p ( eV ~ lS) 

- l ~y(l+8)pn ( 23 ) 

e (e y -l)pn 

= e y(l+S)pn ( 24 ) 

e (e»-l) 

= (js«i r - < 25 » 

The inequality (19) is based on Markov inequality. The transition from (20) to (21) is due to the 
independence of those variables X±, ■ ■ ■ , X n . 

I 



.(e»-l) 



Since ( e iJ ( 1+ g ) ) is minimal at y = ln(l + <5), we have Pr(X > (1 + 8)pn) < 



(l+5)( 1 +' 5 ) 



pn 
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Define g(S) = Trrjkri+s) • We note that g(8) is always strictly less than 1 for all S > 0, and g(5) 

is fixed if 5 is a constant. This can be verified by checking that the function f(x) = In , %1+x) = 

x — (1 + x) ln(l + x) is decreasing and /(0) = 0. This is because f'(x) = — ln(l + x), which is less 
than for all x > 0. 



Theorem 16. Let X±, ■ ■ ■ , X n be n independent random, 0-1 variables, where Xi takes 1 with prob- 
ability at most p. Let X = Ya=i -^Q- Then for any 5 > 0, Pr(X > (1 — S)pn) < e~ 



~'s 2 



Proof: Pr[X < (l-S)pn] = Pr[— X > —(l-5)pn] = Pr[e _J/X > e ~ y( ~ 1 ~^ pn ] for each real number 
y. Applying Markov inequality, we have 



Pr[X < (1 - 5)pn] < 



U?=iE(e 



-yXi] 



< 



< 



e -y(l-5)np 
e (e~v-l)np 



e -y(l-S)np 



pn 



(1-5) 



1-5 



lP n r2 

< e~2 . 



(26) 
(27) 
(28) 
(29) 



The transition from (27) to (27) is to let t = hij^j. The transition from (28) to (29) follows from 



the fact (1 - 5y~ d > e 



Corollary 17 ([14]). Let X±, ■ ■ ■ ,X n be n independent random 0-1 variables and X = Y^i=i-^i- 
i. If Xi takes 1 with probability at most p, then for any i > e > 0, Pi(X > pn + en) < e 
ii. If Xi takes 1 with probability at least p, then for any e > 0, Pr(X < pn — en) < e 



\ne\ 



-I- 2 



'1) follows from Theorem 14. 



2 - 



Proof: For X = £™ =1 , // = E(X) = Ya=i E ( x i) = P n - Let S 

By Taylor theorem, ln(l+e) > e-^. We have that (l + i)ln(l+e) > (l + i)(e- 

|. Thus, -, — i— <e s. Since pn + en = (l + 5)u and the function (l + y)y is increasing for y > 0, 

6 (1+ £ )( 1 +t) v ' \ at ., 

Pr(X > pn + en) = Pr(X > (l + 8)n) < '—r- - '—^ 

Thus (ii) is proved. 



(i+f) 



(!+|) 



(1+f )d + f) 



(1+0 



l+|-4 r >l+ 



<e" 



(1+^) 



5.2. Analysis of Boundary-Phase of Algorithm Recover-Motif 

Lemma 18 shows that with only small probability, a sequence can match a random sequence. It will 
be used to prove that when two substrings in two different B(n, G, a)-sequences are similar, they 
are unlikely not to coincide with the motif regions in the two @(n, G, a)-sequences, respectively. 

Lemma 18. Assume that X\ and X2 are two independent sequences of the same length and that 
every character of X2 is a random character from S . Then 

i. if 1 < \X±\ = |X2 1 < v, then the probability that X\ and Xi are matched is < -j^-y (t = \\T,\\); 
and 



ii. the probability for diff(Xi, X2) < f3 is at most e 
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Proof: The two statements are proved as follows. 

Statement i: For every character X2U] with 1 < j < v, the probability is 7 that X2U] = -Mis- 
statement ii: For every character X2U] with 1 < j < \X2\, the probability is j for X2U] 
to equal -Ml [7]. If difi(Xi, X2) < /3, the two sequences X\ and X2 are identical at least (1 — /3)|Xi| 
positions, but the expected number of positions where the two sequences are identical is t|Mi|. 

(1 — 8 — — )^ 2 

The probability for diff(Xi,M 2 ) < (3 is at most e 5-M x il < e -Vl x il by Corollary 17, and 
Definitions 8 and 9. I 



Lemma 19 shows that with small probability, an input Q (n, G) sequence contains motif region 
that has many mutations. 

Lemma 19. With probability at most -f^., a @ a (n, G) sequence S changes more than <^t characters 
in its first left t motif region N(S) for some t with y < t < \G\, where c = e 3 . 



Proof: Every character in the H(S') region has probability at most a to mutate. We know that 

1^(5)1 = |G| > d. By Corollary 17, with probability at most e~~3~*, a sequence S in Z\ has more 
than (q + e)t mutations (recall the setting for /3 at Definition 9) among the first left t characters. 

The total is Y,T= y e ~^' = T^- ■ 

Lemma 20 shows that Improve-BoundariesQ has good chance to improve the accuracy of rough 
motif boundaries. 

Lemma 20. Assume that @ a (n,G) sequence Si has Lg i £ [LB(Si) — L,LB(»Sj) + L] and 
R Si e [RB(Si) - L,RB(Si) + L] for i = 1,2. Then for (roughLeft 5l , roughRight Sl , roughLeft^ , 
roughRight 52 ) =Improve-Boundaries(S\ , L$ 1 , Rs\ , S2 , L$ 2 , Rs 2 >L)> we have the following two facts: 

i. With probablity at most j-^;+ (-i_y2 h 5i2 x > roughLeftg. is not in [LB(S'j) — (v +u),LB(6 , j)] 

fori = 1,2. 

ii. With probablity at most j^ -\ — (1 _ c ^ h 5 . 2 1 r n , roughRight^. is not in [RB(5j),RB(5j) + 

(v + u)] for i = 1,2. 

m. Improve-Boundaries(Si,Ls 1 ,Rs 1 ,S2,Ls 2 ,Rs 2 iI j ) runs in 0(1? log n) time. 
Proof: We need a bound for the following inequality: 

5>-<(T^ (30) 

Let f(x) = Ya^=j edtx ■ Compute the derivative f'(x) = 9J2i^=j ie eix . We also have the closed form 

Six 

for the function fix) = ^ 6x , which implies 



m = .„ „ - , , , ,.„ , (3i) 



Let 6 = In a and x = 1. We have ££.- ia* = J ~,^" ffi — < 



9je ejx (l - e ex ) - e e i x {-9e ex ) 

^■ e ejs _ #(j _ i)e 9 0'+i)* 

(1 - e fe ) 2 ' 

00 • i _ ja j -(j-l)a i+1 



(32) 



]a J 



»=J ""■ " (1-a) 2 - (1-a) 2 
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Statement i. By Lemma 19, with probability at most 2 ^— , one of the left motif first y char- 
acters region of Sj will change ^y characters. Therefore, with probability at most P\ = 2j^, 
roughLeft 5i > LB(Si). 

For a pair of positions p in S\ and q in S 2 , without loss generality, assume that p has larger 
distance to the left boundary LB(Si) of Si than q to the left boundary LB(S 2 ) of S 2 . Let v + y be 
the distance from p to the left boundary LB(Si) of Si. 

By Lemma 18, the probability is at most c v+y that there will be a match. There are at most 
(v + y) cases for q. With probability is at most P 2 = 2 Y^?= u ( v + y)c v+y < (1 " e ^ — by inequality 
(30), roughLeft Sl < LB (Si) -(v + u). 

For the cases that one position is in random region and has distance more than do log n with 
the left boundary, the probability is at most P3 = n 2 c r° gn < 5 . 2 x n by inequality (14). 

Therefore, we have total probability at most Pi + P 2 + P3 that roughLeft^ is not in [LB (Si) — 
(« + «),LB(5i)]. 

Statement ii. One can also provide a symmetric analogous proof for this statement. 

Statement iii. The computation time easily follows from the implementation of Improve- 
Boundaries(Si, L Sl , Rs 1 , S 2 , L s . 2 ,Rs 2 )- I 

Lemma 21. Assume that for each L with < L < ] -^-, with probability at most ?(n) ; Lsi 
[LB.% — LjLB^ + L] fori = 1,2, where (Ls l , Rs 1 , Ls 2 i Rs 2 ) =Collision-Detection(Si,Ui,S2,U2), 
U\ =Point-Selection(Si,L), and U2 =Point-Selection(S2, L) . Then with probability at most q(n) + 
2 \i-cyi 1 + T=- c + hhn > Initial-Boundary(Si, S 2 ) returns (L Sl , Rs x , Ls 2 , R,s 2 ) with L Si G" [LB(Sj)- 
(v + Ul ),LB(Si)] orR St G* pBO^RBOS) + (v + «i))] /or z = 1,2; 

Proof: It follows from Lemma 20. I 

Lemma 22. Assume that with probability p < 0.5, each S 2i _i has its rough boundaries 
roughLeft^ [LB^^) -^LBtS^)] or roughRight^ [RB(S 2 ,_ 1 ),RB(S^_ 1 ) + u], t/ien 

with probability at most e~(°- 5 ~ p ~ e > fcl ' 3 ; l mo uf is not in ]\G\ — 2u, \G\ + 2u], where l mo uf is selected 
as median o/U i j =1 {(roughRight s ./ — roughLeftg/ )}. 



Proof: If both roughLefty G [LB^^) - u,LB(S 2i _ 1 )] and roughRight 5 / G 

[RB(S 2i _ 1 ),RB(S 2i _i) + u], then (roughBightg/ - roughLeft s # ) is in [\G\ - 2u, \G\ + 2uj. 

If the median of U i ^ 1 {(roughRight 5 / — roughLeftg/ )} is not in [|G| — 2u, \G\ + 2u], then 
there are at least [^ij *s to have roughLeft^/ [LB(S 2i _ x ) — u, LB(S 2i _ 1 )] or roughRightg/ G" 
[RB(S 2i _ 1 ),RB(S 2 ,_ 1 ) + n]. 

On the other hand, the probability is at most p, roughLeftg/ G" [LB(S 2 j_i) — u,LB(S 2i _ 1 )] or 

roughRight 5 / [RB(S 2i _ 1 ),RB(S 2i _ 1 ) + u\. So, this lemma follows from Corollary 17. I 

For a 0(n, G, a)-sequence S, we often obtain its left rough boundary with roughLeft^ < LB(S). 
Some times its exactly left boundary may be miss in the algorithm. 

Definition 23. 

• A 0(n, G, a)-sequence S misses its left boundary if roughLeft^ > LB(S). 

• A 6(n, G, a)-sequence S misses its right boundary if roughRight 5 < RB(S). 
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Definition 24. 

• A 0(n, G, a)-sequence S contains a left half stable motif region H(S') if diff (G'[l, h], G[l, h]) < 

2 

% for all h = v,v + l, ■ ■ ■ , m, where G' = N(S'), c = e~~ and m = \G\ as denned in Definition 8 
and Section 2, respectively. 

• A 6(n, G, a)-sequence S contains a rig/ii /io// stable motif region H(S) if diff(G / [m — 
h, m], G[m — h, m]) < ^ for h = v — l,v + 1, ■ ■ ■ ,m — 1, where G' = N(<S) and m = |G|. 

• A G(n, G, a)-sequence S contains a stable motif region N(S) satisfying the following conditions: 
(1) G'[i] = G[i] for i = l,-- -,17-1; (2) G'[m-i + l] = G[m - i + 1} for i = 1, ■ ■ ■ ,v - 1; (3) 
(S motif region is both left and right half stable, where G' = N(S') and m = \G\. 

Lemma 25. Assume that 

• /mow/ e [\G\ - 2(v + ui), |G| + 2(i> + ui)]; 

• S contains a both left half and right half stable motif region and roughLeftg G [LB(S') — (v + 
iti),LB(5)] and roughRight^ G [RB(5), RB(S') + (v + ui)] (see Definition 8 for u\ and v); 
and 

• /or each L with (v + u\) < L < J-j; if Si has roughLeft^ G" [LB5 1 — L,hBs 1 + 
L] and roughRight^ G" [RB^ — L,RB£ 1 + L], then with probability at most <r(n), 
L5" G" [LB5" — 2L,LB S // + 1L\ for i = 1,2, where (Ls 1 , i?5 x , L s » , i? S " ) = Collision- 
Detection(S\, U±, S", U2), U\ =Point-Selection(S\, L, [roughLeft Si — 2L, roughLeft^ + 2L])U 
Point- Selection(S 1, L, [roughRight Sl — 2L, roughRight 5l + 2L]), and U2 = Point- S ' election(S" , 
i,[l,|Sf|]). 

• The rough boundaries for all sequences S'l G ^2 are computed via (Ls,Rs,L s »,R s ») = Collision- 
Detection^, Us, S'l , Us"), and (Ls, i?,s,roughLefto», roughRighto» )=Improve-Boundaries(S, 
Ls,Rs, S'l ,L S », R S n ,2L). 

Then with probability at most e ~ , there are more than (2(?(n) + (v + u±)^— + ^— ) + e)&2 
sequences S'f in {S'{ , ■ ■ ■ , S'Q with roughLeft(Sf) [LB(Sf)-(v+u),LB(,Sf)] or rougriRight(Sf) 

tRB(5?'),BB(S;') + (« + «)]■ 

Proof: According to the condition of this lemma, with probability at most Pi = ?(n), 
L S " G" [LB 5 // — 2L,LB 5 // +2L], where (Ls, Rs, L S ", Rs") =Collision-Detection(S r , Ui,S' i ',U2) and 
(C/i,t/ 2 ) =Point-Selection(S,^ / ,L). 

For a fix pattern from S, by Lemma 18, with probability at most Y^yL v +u ° y = ~[^7> ^ has 
distance more than v + u to the true left boundary. As we need to deal with v + u\ possible 
patterns from S, with probability at most P 2 1 = (v + ui)%— -, roughLeft^// < LB(S'' / ) — (v + u). 

Similarly, with probability at most P2,r = (v + ui)^— -, roughRight^// < RB(5f ) + (v + u). Let 

Pi = Pl,l + ^2,r- 

With probability at most ^3^ = j£— , S"f does not contain a left half stable motif region by 
Lemma 19. Similarly, with probability at most P^ )T = yzd S'l does not contain a right half stable 
motif region. Let P3 = P31 + -P3 >r - 

Although S is involved to search the left boundary with all other sequences. The non-missing 
condition is to let each sequence do not change too many characters in the motif region. Therefore, 
this is an independent event for each sequence. It is safe to use Chernoff bound to deal with it. 
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Figure 1: G" and M 
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With probability at most P = e 3 , the are more than (Pi + P 2 + -P3 + e)k 2 sequences 
S'{ in {S'{,---,S'IJ with roughLeft(Sf) [LB(Sf) - (u + u),LB(S(')] or roughRight(Sf) 

[RB(,Sf),RB(^) + (z; + n)]. 
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5.3. Analysis of Extract-Phase and Voting-Phase of Algorithm Recover-Motif 

Lemma 26 shows that with high probability, the left and last parts of the motif region in a 
0(n, G, a)-sequence do not change much. 

Lemma 26. With probability at most Qq, a 0(n, G , a)-sequence S does not contain a stable motif 
region. 

Proof: The probability is V\ = 2{v — 1)q not to satisfy conditions (1) and (2) of Definition 24. 
Consider condition (3). Since every character of H(S')[l,m] (notice that m = \G\) has probability 

1 2 

at most a to mutate, by Corollary 17, the probability is at most e~s e r that diff (G[l, h], G'[l, h\) > 
I = a + e. Let V3 = X^U e_3 ~ e r = i~7i where c = e~s e as defined in Definition 8. Therefore, 
the probability is at most V3 that diff (G[l, h], G'[l, h\) > § = a + e for some h G {v,v + 1, ■ ■ ■ , m}. 

1 2 v 

Similarly we define V4 = J2"r^=v e ~^ e r — jzr f° r the probability on the right-hand side. The 
probability is at most V4 that diff(G[m — h,m],G'[m — h,m]) > ^ = a + e for some h £ {v, v + 
1, • • • , m}. The probability that S does not contain a stable motif region is at most V1+V3+V4 = Qo- 

I 

Definition 27. Assume that Z\ = {S[, • • • , S^ifci} contains S^-i that contains a stable motif re- 
gion. We fix such a S' 2i _ 1 . 

• Define Gl = H(S' 2i _ 1 )[l,do logn — 1] to be the left part of the motif region H(S' 2i _i)- 

• Define Gr = ^(S 2i ^i)[\G\ - (efologn) + l, \G\] to be the right part of the motif region N^j-i)- 

Lemma 28 shows that with high probability, Extract-Phase of algorithm Recover-Motif extracts 
the correct motif regions from the sequences in Z\. It uses G" to match H(S) in another sequences 
S. The parameter R gives a small probability that the matched region between G" and S is not in 
N(S). 

Lemma 28. 
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i. Assume that G\ and G r are fixed sequences of length do logn. Let S be a @(n, G, a)-sequence 
with M £ Match (Gi, G r , S) and let wo be the number of characters of M that are not in 
the region of "R(S). Then the probability is at most R that wq > 1, where R is defined in 
Definition 8. 

ii. The probability is at most Qq that given a Q(n,G, a) -sequence S, Match(Gx, G/j, S") = 0. 

Proof: Assume that wq > 1. Let w be the number of characters outside of N(S') on the left of 
M, and let w' be the number of characters outside of 'tt(S) on the right of M. Clearly, wq = w + w' . 
Since wq > 1, either w > 1 or w' > 1. See Figure 1. Without loss of generality, we assume w > 1. 

Statement i: There are two cases. 

Case (a): 1 < w < v. By Lemma 18, the probability for this case is at most j for a fixed w. 

The total probability for this case for 1 < w < v is at most X^=i F — X^i h = <~~l ■ 

_£ 
Case (b): v < w. By Lemma 18, the probability is at most e 3 w for a fixed w. The total 

_£ 
probability for v < w is at most J2w=v e 3 w = ~t~ c ■ 

The probability analysis is similar when w' > 1. Therefore, the probability for this case is at 
most R = (-j-^j + j-^) for wq > 1. 

Statement ii: By Lemma 26, with probability at most Qq, S does not contain a stable motif 
region. Therefore, we have probability at most Qq that given a random 0(n, G, a)-sequence S, 
M&tch{G L ,G R ,S) = 0. I 

Lemma 29 shows that we can use Gi and G r to extract most of the motif regions for the 
sequences in Zi if G' = Gl (recall that Gl is defined right after Lemma 26). 

Lemma 29. Assume that Gi and G r are two sequences of length do logn, and Gi = Match(G;, G r , S") 
for S'l G Z2 = {S'{, • • • , S'£ 2 } and i = 1, • • • , ki (recall that each sequence Gi is either an empty se- 
quence or a sequence of the length \G{\). 

i. If Gi = Gl, G r = Gr, and there are no more than yk2 (y £ [0,1],) sequences S'/ with 
roughLeft 5 » [LB(Sf ) - (v + u 2 ), LB(5f )] or roughRight 5 ,, [RB(5f ), RB(5f ) + (v + u 2 )], 

€ 2 k 2 

then the probability is at most e ~ that there are more than (Qq + y + e)k 2 sequences Gi 
with Gi = $. 



2 '-2 



L^k 



ii. For arbitrary G[ and G r , with probability at most e 3 , |{^|Cj ¥" $ an d Gi ^ ^(S'[),i = 
1, ■ ■ ■ , k 2 }\ > (R + e)k 2 , where R is defined in Definition 8. 

Proof: Recall that sequence G\l is selected right after Lemma 26. 

Statement i: By Lemma 28, for every S" £ Z 2 , the probability is at most Qq that S" does not 

<j 2 fc 2 
contain a stable motif region H(S'f ). By Corollary 17, we have probability at most e ~ that there 

are more than (Qq + y + e)k 2 sequences Gi with Gi = 0. 

Statement ii: By Lemma 28, the probability is at most R that Gi 7^ ^(S'/). By Corollary 17, 

e 2 fc 2 - 

with probability at most e ~ , \{i\Gi / N(Sf),z = 1, • • • , k 2 }\ > (R + e)/c 2 . I 

Definition 30. 

• Given two sequences G r and G r , define 

M(G r , G r ) = {G'l : G'( =Match(G ; , G r , roughLeft S // , roughRight s » , S'J ) * = !,■■■, A; 2 }. 
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• For a @ a (n,G) sequence S, define Gs,l to be the N(S)[l,do logn], which is the leftmost 
subsequence of length do log re in the motif region of S. 

• For a @ a (n,G) sequence S, define Gs,r to be the N(S)[m — dologn + l,m], which is the 
rightmost subsequence of length do logn in the motif region of S, where m = \G\ = \H(S)\. 

the condition iv of Lemma 31 

Lemma 31. Assume that we have the following conditions: 

i. For each L with < L < K^-, with probability at most <Ji(n), Ls t [LB^ — 2L,h3s i + 2L] 
and Rsi [RB^. — 2L,RBs_ + 2L] for i = 1,2, where (Lg 1 , Rg 1 , Ls 2 , Rs 2 ) =Collision- 
Detection(Si, Ui,Sz, U2), U\ =Point-Selection(Si,L, [1, |Si|]), andU2 =Point-Selection(S2, L, [lj^l 

m. For each L with < L < -L^, z/Si /ias roughLeft 5l [LB^j— L, LB^+L] and roughRight 5i 
[RB5 X — LjRB^ + L], then with probability at most ^(n), L s » [LB5// — 2L,LB 5 // + 2L] 
fori = 1,2, where (Ls 1 ,Rs 1 ,Lg'/,Rsi/) = Collision-Detection(Si,Ui, S" ,1/2), U\ =Point- 
Selection(Si, L, [roughLeft 5l — 2L, roughLeft s +2L])U Point- Selection(Si, L, [roughRight 5i — 
2L, roughRight 5i + 2L\), and U 2 =Point-Selection(S'{ , L, [1, \S'{ |]). 

m. T/ie inequality (Po + Qo) < c o /io/<is /or some constant cq < 1, where Qq is defined at equation 
(13) and P = ft(n) + ^gT + & + ink- 

zv. Tfte inequality 1 — 2(Qq + ^o + (-R + 2e)) — (a + e) > holds, where Vq = (2(^2 (ti) + (v + 

uiY-j^ + ^ + e). 

Then the algorithm generates a set of at most &2 subsequences for voting and votes a sequence 
G' such that 

(1) with probability at most e~ n( ~ k ^ + e~ n( - k ^ , \G'\ / \G\, and 

(2) for each \<i< \G\, with probability at most e -^ k ^ + e - n ^ 2 \ G'[i] / G[i\. 

Before proving Lemma 28, we note that both ?i(re) and ft( n ) is at most ^t f° r 
all of the three algorithms. They will be proved by Lemma 40 and Lemma 41 for 
the case algorithm-type=RANDOMIZED-SUBLINEAR, Lemma 43 and Lemma 44 for the 
case algorithm- type=RANDOMIZED-SUBQUADRATIC, and Lemma 47 for the case algorithm- 
type=DETERMINISTIC-SUPERQUADRATIC. 
Proof: 

By Lemmas 21, with probability at most Pq = <?i (re) + wz§2 h j^; + 5 , 2 1 Zra , roughLeft^/ 

[mS'2i-i) - (v + u 1 ),LB(S' 2i _ 1 )} or roughRight^ [RB^), RB(,S 2 ,_ 1 ) + (v + Ul )]. 

By Lemma 22, with probability at most P a = e -(°- 5 - p o- € ) fe i/ 3 = e ( fcl ', the approximate motif 
length Imotif is not in the range [\G\ — 2{v + u\), \G\ + 2(v + u\)]. 

By Lemma 26, with probability at most Qq, a a (re, G) sequence does not contain a stable 
motif region. Therefore, with probability at most P\ = (Pq + Qq) x , the following statement is 
false. 

(i) One of S' 2i _ 1 for i = l,---,k\ has roughLeft^/ G [LB(S' 2i _ 1 ) — (v + u±), LB(S' 2i _ 1 )], 

roughRight^/ € [RB(S , 2i _ 1 ),RB(S , 2 j_ 1 ) + (v + Ui)], and has a stable motif region. 

e 2 k 2 

By Lemma 25, with probability at most P 2 = e ~ , there are more than (2(<^(re) + 
(v + «i)t^ + i=h) + e)k 2 sequences S'( with roughLeft 5 „ [LB(Sf) - (u + u 2 ),LB(5f)] or 
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roughRightc// [RB(S'' / ), LB(S"') + (v + U2)]. In other words, with probability at most P2, the 

i 

following statement is false: 

(ii) There are no more than V0&2 sequences S'/ with roughLeftg// [LB(S'f) — (v + U2),LB(S*f)] 

or roughRight 5 » [RB(S'/),RB(S'/ ) + (u + u 2 )], where Vo = (2(ft(ra) + [v + «i)^ + ^) + e). 

Assume that Statement (ii) is true. By Lemma 29, with probability at most P3 = c k2 , the 
following statement is false. 

(iii) M(Gl, Gr) contains at most (Qq + Vq + e)k2 empty sequences. 

We start from the rough left boundary roughLeft^ of S\ to match the other left boundaries of 
S" for i = 1, ■ ■ ■ , %2- There are totally at most 2(v + u\) candidates to consider. 

By Lemma 29, if M(Gi,G r ), which consists k<i matched regions, has at most (Qq + Vq + e)&2 
empty sequences, then it has more than (R + e)k2 from non- motif regions with probability at most 

e 2 fc 2 

P4 = 2(v + u\)e 3~ . After the pattern is fixed, those events in the matching are considered to be 
independent each other. This is why we can apply the Chernoff bound to deal with them. So, the 
probability is at most P4, the following statement is false. 

(iv). If M(Gi, G r ) contains at most (Qq + Vq + e)&2 empty sequences, then M(Gi, G r ) contains 
at most (Qq + Vq + e + (R + e))/c2 = (Qo + V$ + (R + 2e))k2 elements not from motif regions 
{HS'D :\<i< k 2 }. 

Therefore, with probability at most P\ + P2 + P3 + Pa = e ™ kl > 4- e ^( fc2 ) ; the sequences are 
not ready for voting in the next phase, which means the following two conditions are satisfied: 

(a). There exists Gi and G r generated by the algorithm such that M(Gi,G r ) contains at most 
(Qo + Vo + (-R + 2e))/c2 elements not from motif regions {H(S'f ) : 1 < i < A^}. 

(b). For every Gi and G r that M(Gi,G r ) contains at most (Qo + Vo + e)&2 empty sequences 
generated by the algorithm, M(Gi, G r ) contains at most (Qq + Vq + e + (R + e))&2 = (Qo + Vb + 
(R + 2e))&2 elements not from motif regions {H(5'f ) : 1 < % < ^2}- 

Statement (1): For a M(Gi, G r ) with at most (Qq + Vq + (R + 2e))k2 elements not from motif 
regions {tt(S") : 1 < % < k 2 }, we still have k 2 - (Qo + Vq + (R + 2e))k 2 elements in M(G u G r ) 
from motif regions {H(S*") : 1 < % < ^2}. By by the condition (iv) in this lemma, we have 
h ~ (Qo + V + (R + 2e))k 2 > (Qo + V + (R + 2e))fc 2 . Therefore, \G'\ is selected to be the length 
of G in the Voting-PhaseQ. 

Statement (2): For a M(G h G r ) = {G'{, • • • , G'^} with at most (Qq + Vq + (R + 2e))k 2 elements 
not from motif regions {N(5f ) : 1 < i < /C2}, we still have A;2 — (Qo + Vb + (R + 2e))/c2 elements 
in M(Gi,G r ) from motif regions {N(S'f) : 1 < % < ^2}. By Corollary 17, with probability at most 

e 2 k 2 

e a - there are more than (a + e)&2 characters are mutated in the same position among all fe the 
motif regions for the sequences in Z2. We have that k2 — (Qo + Vq + (R + 2e))/c2 — (a + e)&2 > 
(Qo+Vo+(R+2e))k2 by the condition (iv) in this lemma. We let G'[j] be the most frequent character 
among G'([j],--- , G'£\j] in Voting-Phase. Therefore, with probability at most e~ n ( k ^ + e~ n( - k2 \ 
G'[j)^G\j}. 
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We will use multiple variable functions to characterize the computational time for three algo- 
rithms. In order to unify the complexity analysis of three algorithm, we introduce the following 
notation. 

Definition 32. A function T(x, y) : N x N — > N is monotonic if it is monotonic on both variables. 
If for arbitrary positive constants c\ and c 2 , T(cix,C2y) < cT(x,y) for some positive constant c, 
then T(x, y) is slow. 



24 



Lemma 33. Assume that T(x,y), s(n,L) and g(n,l) are monotonic slow functions. Assume that 
Collision-Detection(Si, U±,S2, U2) returns the result in time t(n, \\Ui\\ + \\U2W) time and the Point- 
Selection(Si, S2, L)) selects s(n,L) positions in g(n,L) time. Assume that with probability at most 
(f(n), the function does not stop Initial- Boundaries ■() does not stop when L < \G\/4, and \\Ugi . || + 
|[/s»|| in the algorithm Recover-Motif is no more than f(n, \G\). 

Then with probability at most ki<p(n), the entire algorithm Recover-Motif does not stop in the 
time complexity (0(MEi'ii( T K s(n, ^575)) + gfo, 2 i ^2/s ))) + hh 2 logn + k x k 2 t(n, f(n,\G\)) + 
h 2 logn) + kik2(logn)(\oglogn)),0(k2)), where iq is the largest j such that ■ " 2/5 < min(n 2 ' 5 , \G\) 
and h = min(n 2 / 5 , |G|). 

Proof: The function Initial-Boundaries()is executed k\ times. According to the condition that 
with probability at most (p(n), the function does not stop Initial-Boundaries (.) does not stop when 
L < |G|/4, we have the fact that with probability at most knp(n), one of those executions of 
Initial-Boundaries (.) does not stop when L < |G|/4. 

In the rest of the proof, we assume that all executions of Initial-Boundaries (.) stops when 
L < \G\/A. 

When L = 0(h), we detect rough left and right motif boundaries and run Improve- 
Boundaries(), which takes 0(h 2 logn) time. It takes 0(^° =1 {T(n, s(n, . " 2/5 )) + g(n, t " 2/5 ) + 
h 2 logn) time to run Initial-Boundaries(S , 2 i _ 1 , S 2 j) one time for one pair (S' 2i _i, S' 2 j) in Z\. It takes 
0(fci(£ili(*(rc,s(n, ^575)) + g(n, ^575) + hh 2 logn) time to run Initial-Boundaries(S , 2 i _ 1 , S' 2i ) 
one time for all pairs (S' 2i _ 1 , S' 2i ) in Z\. 

It takes k2(t(n, f(n, \G\)) + h 2 logn) time to find the rough boundaries for all sequences in Z2 
with a fixed sequence S from Z\ by executing the for loop "For each S'- £ Z 2 " in the algorithm 
Recover-Motif. It takes k\k2(t(n, f(n, \G\)) + h 2 logn) time to find the rough boundaries for all 
sequences in Z 2 via all sequences S' 2i _ 1 from Z\ through for loop "For each S'- £ Z 2 " in the 
algorithm Recover-Motif. 

Recall that parameters v and u\ are constants, and U2 is O(loglogn). Calling Match(G;, G r , S") 
takes 0({v + U2)logn) time for each S" G Z2. The total times for calling Match(G;,G r , S") is 
0(kik2(v + ui)(v + u 2 )logn) = 0(k\k2(\ogn) (log logn)). 

The voting part takes 0(^2) time for executing voting for recovering one character in motif. 
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5.4. Randomized Algorithms for Motif Detection 

In this section, we present two randomized algorithms for motif detection. The first one is a 
sublinear time algorithm that can handle j. — ^wjr mutation, and the second one is a super-linear 
time algorithm that can handle fi(l) mutation. They also share some common functions. 

Lemma 34. Let c be a constant in (0,1). Assume m and n are two non-negative integer with 
m < n. Then for every integer m\ with < mi < |^, (m) ™ — e '' 2 , where constant 

S c = ~ 2 nc as defined in Definition 8. 

Proof: We have the inequalities 



n 



< n mi c m (33) 

= e milnn c m (34) 
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< e ^ lnn c m (35) 

= e 5cm e mlnc (36) 

= e ( mlnc )/ 2 (37) 

I 

Lemma 35. Let S = UUV be a set of n elements with U (~)V = ®. Assume that Xi, ■ ■ ■ , x m are m 
random elements in S. Then with probability at most (L )( — n mi ) m , the list xi, • • • ,x m contains 
at most mi different elements from U (in other words, ||{ x\, ■ ■ ■ ,x m } n U\\ < mi). 

Proof: For a subset S"CS with \S'\ = mi, the probability is at most (H!i) m that all elements 
xi, ■ ■ ■ , x m are in S'. For every subset ICS with \X\ < mi, there exists another subset S'C5 
such that \S'\ = mi. We have that Pr[||{ xi, ■ ■ ■ ,x m } n U\\ < mi] < Pr[{xi, • • • ,x m } D U C 
U' for some U' C U with \\U'\\ = mi]. There are (" ") subsets of U with size mi. We have the 

probability at most (" '')(- — " mi ) m that xi, • • • ,x m contains at most mi different elements in U. 

I 

Lemma 36. Let 5 be the same as that in Lemma 34- Let (3 be a constant in (0, 1) and c = 1 — §■ 
Let mi < j^p^ and m < n 1_e for some fixed e > 0. Let Si and S2 be two sets of n elements with 
\Si n S2I > $n and C be a set of size \C\ < 77711 for some constant 7 G (0, 1). T/ien /or all large n, 

(1 — 'yjmi m 

iuzi/2 •probability is at most 2e « , we /jewe (A — C) n (B — C) = $, where A = {xi, ■ ■ ■ , x m } 

and B = {yi, ■ ■ ■ ,y m } are two sets, which may have multiplicities, of m random elements from Si 
and S2, respectively. 

Proof: In the entire proof of this lemma, we always assume that n is sufficiently large. We 
are going to give an upper bound about the probability that B does not contain any element in 
A — C. For each element y^ & B, with probability at most 1 — — that yi is not in A. Therefore, 

the probability is at most (1 — [l "~" — -) m that B does not contain any element in A — C. 

By Lemma 35, the probability is at most (^)( (1 " /3) " +mi ) m that \\A n (Si D ^2)! I < mi. We 
have the inequalities 

Pr[(A - C) n (B - C) = 0] (38) 

= Pi[(A - c) n (73 - c) = 0| pn(Sins 2 )||>mi]-Pr[pnOSins 2 )||>mi]+ (39) 

Prp - C) n (B - C) = 0| ||An(SinS 2 )||<mi]-Pr[|An(SinS 2 )|<mi] (40) 

< Pt[(A - C) n (B - C) = 0| ||^n(5in5 2 )||>mi] + Pr[||An(5in52)||<mi] (41) 

_ IKAnftn^ll-HCH + //3n\ (l-/j) w + mi 

n \mij n 



n \mij n 

(i- 7 )m 1 m / /3n\ (1 — /3)n + mi 



n \mij n 

< (i ( 1 -7)mi . m | /^n\ (l-^)ra + mi 

) m (44) 

(45) 

(46) 
(47) 
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The inequality (1 — - — Z ) m < e « , which is used from (43) to (44), follows from the fact 



n) 



that 1 — x < e~ x . The transition from (44) to (45) follows from the fact — < ^ since mi = o 
according to the conditions of the lemma. 

It is easy to see that -mine™ = -^"ln c * — n ^ or an l ar § e n - Thus, ~T^ mim - > (j7j,lnc)/2 
(note that lnc < as c G (0, 1)). Thus, by Lemma 34, (££)(1 - |) m < e mlnc / 2 < e s . This 

(1 — "j)m-^m 

is why we have the transition from (46) to (47). Therefore, Pr[(A— C)D(B — C) = 0] < 2e « . 
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5.4.1. Sublinear Time Algorithm for ( lo n y2+^ Mutation Rate 

In this section, we give an algorithm for the case with at most T , ^rrrr mutation rate. The 

performance of the algorithm is stated in Theorem 2. 

Definition 37. A position p in the motif region N(S) of an input sequence S is damaged if there 
exists at least one mutation in S[p,p + cfologn — 1]. 



Lemma 38. Assume that aL = (logn) 1+ W. With probability at most e ( logn ) ; there are 

more than -t-— 
(lof 

are damaged. 



more than j ^jpj positions that are from the M sampled positions in an interval of length L and 



Proof: By Theorem 16, with probability at most P\ = 2 aL (let 5 = 2), there are more than 3aL 
mutation in an interval of length L. Therefore, with probability at most 2~ aL = e~( logn ' , there 

are more than 3aL log n positions are damaged. Therefore, each random position in an interval of 
length L has at most probability - L ° gra = 3a log n to be damaged. 

Since a = (j , 2+n ^ 1) ) and M positions are sampled, by Theorem 16, with probability at most 

P2 = 2~' 3 ° ogn ' = e~^ gn ' (let 5 = 2), the number of damaged positions sampled in an 
interval of length L is more than (1 + 5)3a log n)M = (9a log n)M = j \n(Tj- Thus, with total 

probability at most P\ + P2 = e~^ ogn > , there are more than j h^Tf damaged positions that 

are from the M sampled positions in an interval of length L. I 

Definition 39. Let A be a set of positions in an input sequence S with N(5) = [i,j]- Let 
A(S,X(S))=An[i,j}. 

Lemma 40. Assume that \G\ > - "Yqq — & n d d^logn < L < \G\/2. Let I\ be a union of inter- 
vals that include [LB(6"i) - 2L,LB(5i) + 2L] and [RB(S'i) - 2L,RB(5i) + 2L\. Let U x =Point- 
Selection(Si,L,Ii), U2 =Point-Selection(S2,L,[l,\S2\]), and (Ls l ,Rs 1 ,Ls 2 ,Rs 2 ) =Collision- 
Detection(Si, U\, S2, , C^)- Then 

i. With probability at most i^r^, the left rough boundary L$ 1 has at most 2L distance from 
LB (Si) and the left rough boundary Lg 2 has at most 2L distance from LB(S2). 

ii. With probability at most ipr^> the right rough boundary Rs 1 has at most 2L distance from 
RB(Si); and the right boundary of Rs 2 has at most 2L distance from RB(S2). 

Proof: We prove the following two statements which imply the lemma. 
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i. With probability at most ;p— s , there is no intervals A4 from Si and Bj from S2 such that 
(1) lA^Si, N(Si)) n B,-(S 2 ,N(S2))| is at least \ ; (2) the left boundary of Si has at most 2L 
distance from Af, (3) the left boundary of S2 has at most 1L distance from Bj; and (4) there 
is collision between the sampled positions in Ai and Bj. 

ii. With probability at most -^Fr?i there is no intervals Ai from S\ and Bj from S2 such that 
(1) \Ai(Si, K(5i)) n Bj(S 2 , K(S 2 ))| is at least \ ; (2) the right boundary of S x has at most 2L 
distance from Af, (3) the right boundary of S2 has at most 2L distance from Bj; and (4) 
there is collision between the sampled positions in Ai and Bj. 

We only prove the statement i. The proof for statement ii is similar to that for statement i. Note 
that L goes down by half each cycle in the algorithm. Assume that L satisfies the condition of this 
lemma. 

Select Ai from Si and Bj from S2 to be the first pair of intervals with | |v4i(S'i , H(5'i)) n 
Bj(S2,^(S2))\\ > ■§• It is easy to see that such a pair exists and both have distance from the 
left boundary with distance at most 1L. This is because when an leftmost interval of length L is 
fully inside the motif region of the first sequence, we can always find the second interval from the 
second sequence with intersection of length at least -^ ■ 

Replace m by M(L), m\ by M\{L) (see Definition 10), and n by L to apply Lemma 36. We 
also let C be the set of damaged positions affected by the mutated positions. With probability at 
most 0(2^—3), C has size more than Q(Mi(L)) by Lemma 38. With probability at most o(^s), 
there is an no intersection Ai from Si and Bj from S2 ■ I 

Lemma 41. Assume that \G\ < ( "Yoo — an< ^ L ^ s an integer with ofologn < L < \G\/2. 
Let I\ be a union of intervals that include [LB(Si) — 2L, LB(Si) + 2L\ and [RB(Si) — 
2L,RB(Si) + 2L]. Let U\ =Point-Selection(Si,L,Ii), U2 =Point-Selection(S2, L, [1, IS2I]), and 
(Ls 1 , Rs 1 , L$ 2 , Rs 2 ) = Collision- Detection(S 1 , U\, S2, , U2) ■ Then 

i. With probability at most p— j, the left rough boundary Lg 1 has at most \G\/4 distance from 
LB(S'i) and the left rough boundary L$ 2 has at most \G\/4 distance from LB(S < 2). 

ii. With probability at most ^rs, the right rough boundary Rs 1 has at most \G\/4 distance from 



RB(Si); and the right boundary of Rs 2 has at most \G\/4 distance from RB(5: 



2) 



Proof: For two sequences Si and S2, it is easy to see that there a common position in both 
motif regions of the two sequences such that there is no mutation in the next do log n characters 
with high probability. This is because that mutation probability is small. 



By Theorem 16, with probability at most Pn = 2 a \ G \/ 4 (let 5 = 2), there are more than 3a 



\G\ 



M 
4 



mutated characters in the interval K(Sj)[l, '-^-] for i = 1,2. Therefore, with probability at most 
2~ a l G l/ 4 = e ~( log ™) , there are more than 3a^ logn positions are damaged in H(S*i)[l, '-^-]. 
Since the mutation probability is a = (~ , 2+n(1 ) ) and M(L) positions are sampled, with 

probability at most P lj2 = 2~ ( 3adt) log n ) ^ = e -(i°s™) 1+n(1) (with 5 = 2), the number of damaged 
positions is more than ((5ado logn) '-^ ) = j, — [ X 1W by Theorem 16. The probability is Pi = 

Pi,i + Pi,2 = e~( logn ) that left side has more than ((5ado logn)^) = f lo n1"d) damaged 

positions. 

We have similar P r = P r \ + P r ,2 = e~( logn > probability for the right side for more than 

((5ad logn)lf ) = (log 'g n(1) damaged positions in N(Si)[2M _ i, | G |]. 
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Now we assume that left side has more than ((5ado log ny-^-) = j, — J n(1) damaged positions and 

the right side for more than ((5ado logn)^) = j — ' L w damaged positions in ^(5 , j)[^-'- — 1, \G\]. 
Since each position in each interval of length L is selected in Point-Selection(Si, 52, L)???, it is easy 
to verify the conclusions of this lemma. I 

Lemma 42. For the case algorithm-type=RANDOMIZED-SUBLINEAR, we have 

i. CollisionDetection(Si,Ui, 82,1/2) takes t(n,\\Ui\\ + \\U2W) = 0((\\Ui\\ + \\U2\\)logn) time. 

ii. Point-Selection(Si, L,[l,\Si\]) selects s(n,L) = 0((j^)M(L)) positions in g{n,L) = 
0(s(n,L)) time if L > (log ^ 3+T . 

Hi. Point-Selection(Si,L,[l,\Si\]) selects s(n,L) = 0(n) positions in g(n,L) = 0(n) time if 

(logn) 3+T 



L< 



100 



iv. \\Ugi , || + ||{/s»|| in the algorithm Recover-Motif is no more than f(n, \G\) = 0(M(\G\) + 

^m\G\)). 

v. With probability at most drs, the algorithm Recover-Motif does not stop in (0(k(-j= (log n) 2 + 
h 2 log n)),0(k)) time. 

Proof: Statement i. The parameter wrandomized-sublinear is set to be in the Collision- 
Detection. It follows from the time complexity of bucket sorting, which is described in standard 
algorithm textbooks. 

Statements ii and hi. They follows from the implementation of Point-SelectionQ. 

Statement iv. It follows from the choice of Point-Selection(.) for the sublinear time algorithm 
at Recover- Motif (.). 

Statement v. It follows from Lemma 41, Lemma 40, Lemma 33 and Statements i, ii, and hi, 
and iv. I 

We give the proof for Theorem 2. 
Proof: [Theorem 2] The computational time part of this theorem follows from Lemma 42. 

By Lemma 40, Lemma 41, we can let <a(n) = 7p~y < ?o f° r the probability bound ?i(n) in the 
condition (i) of Lemma 31. 

By Lemma 40, Lemma 41, we can let ?2( n ) = Wr — *>o f° r the probability bound ?i(n) in the 
condition (ii) of Lemma 31. 

By inequality (12), the condition (hi) of Lemma 31 is satisfied. 

By inequality (11), we know that the condition (iv) of Lemma 31 can be satisfied. 

The failure probability part of this theorem follows from Lemma 20, and Lemma 31 by using 
the fact that k\,k2, and k are of the same order (see equation (18)). I 

5.4.2. Randomized Algorithm for 0(1) Mutation Rate 

In this section, we give an algorithm for the case with 0(1) mutation rate. The performance of the 
algorithm is stated in Theorem 4. 

Lemma 43. Assume that dologn < L < \G\/2 and \G\ > - "^qq — • Let I\ be a union of inter- 
vals that include [LB(S"i) - 2L,LB(5i) + 2L] and [RB(Si) - 2L,RB(5i) + 2L\. Let U x =Point- 
Selection(S\,L,I\), U2 = Point- S 'election(S2, L, [1, IS2I]), and (Ls 1 ,Rs 1 ,Ls 2 , Rs 2 ) =Collision- 
Detection(Si,Ui, S2, ,1/2). Then 
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i. With probability at most -^rs, the left rough boundary Lg 1 has at most 2L distance from 
LB (Si) and the left rough boundary Lg 2 has at most 2L distance from LB(S2). 

ii. With probability at most -^s, the right rough boundary Rs 1 has at most 1L distance from 
RB(Si); and the right boundary of Rs 2 has at most 1L distance from RB(S2). 

Proof: We prove the following two statements which imply the lemma. 

i. With probability at most ^s, there is no intervals Ai from Si and Bj from S2 such that 
(1) ||Ai(Si,H(Si)) n Bj(S 2 ,H(S 2 ))\\ is at least -§; (2) The left boundary of Si has at most 
1L distance from Af, (3) The left boundary of S 2 has at most 2L distance from Bj; and (4) 
There is collision between the sampled positions in Ai and Bj. 

ii. With probability at most -j— 3, there is no intervals A4 from Si and Bj from S 2 such that (1) 
||A(Si,N(Si)) n Bj(S 2 ,tt(S 2 ))\\ is at least f ; (2) The right boundary of Si has at most 2L 
distance from A^; (3) The right boundary of S 2 has at most 1L distance from Bj; and (4) 
There is collision between the sampled positions in Ai and Bj. 

We only prove the statement i. The proof for statement ii is similar. Note that L goes down 
by half each cycle in the algorithm. Assume that Lq satisfies the condition of this lemma, and let 
L = Lq happen in the algorithm. 

Select Ai from Si and Bj from S 2 to be the first pair of intervals with ||^4j(Si,N(Si)) n 
Bj(S 2 ,'tt(S 2 ))\\ > 2- It i s eas y to see that such a pair exists and both have distance from the 
left boundary with distance at most 1L. This is because when an leftmost interval of length L is 
fully inside the motif region of the first sequence, we can always find the second interval from the 
second sequence with intersection of length at least ^ ■ 

Replace m by M(L), m\ by M\(L) (see Definition 10), and n by L to apply Lemma 36. We do 
not consider any damaged position in this algorithm, therefore, let C be empty. With probability 
at most 0(25—3-), there is no intersection Ai from Si and Bj from S 2 . I 

Lemma 44. Let U\ and U 2 contain all positions of the input sequences S± and S 2 , respectively. 
Assume (Ls 1 ,Rs 1 ,Ls 2 ,Rs 2 ) = Collision-Detection(Si,Ui, S 2 , ,U 2 ). Then 

i. With probability at most ozrs, the left rough boundary Ls 1 has at most do logn distance from 
LB (Si) and the left rough boundary L$ 2 has at most do logn distance from LB(S2). 

ii. With probability at most -^^, the right rough boundary Rs 1 has at most do log n distance from 
RB(Si); and the right rough boundary of Rs 2 has at most dologn distance from RB(S2). 

Proof: For two sequences Si and S2, let H(S a ) be the subsequence S a [i a ,j a ] for a = 1,2. By 
Corollary 17, with probability at most P; = 2c d ° logn < 5 . 2 l n 3 ( see inequality 8 at Definition 14), 
there are more than (a + e)do log n mutations in S a [i a , ia + ^o log n — 1] for a = 1, 2. 

In this case, every position in the two sequences Si and S2 is selected by Point-Selection(Si , S2). 
With probability at most Pi, the left boundary position is missed during the matching. We have 
similar P r to miss the right boundary. 

Assume that p\ and p 2 are two positions of Si and S2 respectively. If one of two positions is 
outside the motif region and has more than do log n distance to the motif boundary, with probability 
at most c _rf ° logn < , 2 i 5 (see inequality 8 at Definition 14) for them to match that requires 
diff(Yi, Y 2 ) < (3 by Lemma 18, where Y a is a subsequence S a [p a ,p a + do log n — 1] for a = 1, 2. I 
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Lemma 45. Assume that do log n < L < \G\/2 and cq log n < \G\ < °^ — • Let Ii be a union of 
intervals that include [LB(Si) - 2L,LB(S 1 ) + 2L] and \BB(Si) -2L,RB(5i) + 2L]. Let XJ\ =Point- 
Selection(Si,L,I\), U2 =Point-Selection(S2,L,[l,\S2\]), and (Lg l ,Rs 1 , L$ 2 , Rs 2 ) =Collision- 
Detection(Si,Ui, S2, ,U2)- Then 

i. With probability at most 2FT) the left rough boundary L$ 1 has at most do logn distance from 
LB(Si) and the left rough boundary L$ 2 has at most do logn distance from LB(S < 2). 

ii. With probability at most -p^-, the right rough boundary Rs 1 has at most dologn distance 
from RB(Si); and the right boundary of Rs 2 has at most dologn distance from RB(S , 2). 

Proof: In this case, every position in the two sequences S\ and S2 is selected by Point- 
Selection(Si , 82)- It follows from Lemma 44. I 

Lemma 46. For the case algorithm-type=RANDOMIZED-SUBQUADRATIC ; we have 

i. CollisionDetection(Si,Ui, S2,U2) takes t(n,\\Ui\\ + \\U2W) = 0((\\Ui\\ + \\U2W) 2 logn) time. 

ii. Point-Selection(Si,L,[l,\Si\]) selects s(n,L) = 0((j^)M(L)) positions in g(n,L) = 
0(s(n,L)) time if L> {los £ +T ■ 

Hi. Point-Selection(Si, L,[1,\S\\]) selects s(n,L) = 0(n) positions in g(n,L) = 0(n) time if 

T (logrt) 3+T 

^ < 100 • 

iv. \\Ug' . || + ||C/£"/|| in the algorithm Recover-Motif is no more than f(n,\G\) = 0(M(|G|) + 

^m\G\)). 

v. With probability at most ^rs; the algorithm Recover-Motif does not stop in (0(k( ^(log n)°^ + 
h 2 logn)),0(k)) time. 

Proof: Statement i. The parameter wrandomized-sublinear is set to be /? in the Collision- 
Detection. It follows from the time complexity of brute force method. 

Statements ii and hi. They follows from the implementation of Point-SelectionQ. 

Statement iv. It follows from the choice of Point-Selection(.) for the sublinear time algorithm 
at Recover- Motif (.). 

Statement iv. It follows from Lemma 44, Lemma 45, Lemma 33, and Statements i, ii, and hi. 

I 

We give the proof for Theorem 6. 
Proof: [Theorem 4] The computational time part of this theorem follows from Lemma 46. 

By Lemma 43, Lemma 44, we can let <?i(n) = ^srs < ?o for the probability bound ft(n) in the 
condition (i) of Lemma 31. 

By Lemma 43, Lemma 44, we can let ^(n) = -^—^ < % for the probability bound <^2{n) in the 
condition (i) of Lemma 31. 

By inequality (12), the condition (hi) of Lemma 31 is satisfied. 

By inequality (11), we know that the condition (iv) of Lemma 31 can be satisfied. 

The failure probability part of this theorem follows from Lemma 20, and Lemma 31 by using 
the fact that k\, /C2, and k are of the same order (see equation (18)). I 
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5.5. Deterministic Algorithm for Q(l) Mutation Rate 

In this section, we give a deterministic algorithm for the case with il(l) mutation rate. The 
performance of the algorithm is stated in Theorem 6. 

Lemma 47. Assume that dologn < L < \G\/2 and cologn < \G\. Let l\ be a union of inter- 
vals that include [LB(Si) - 2L,LB(5i) + 2L] and [RB(S'i) - 2L,RB(S 1 ) + 1L\. Let U x =Point- 
Selection(S\,L : I\), Ui = Point- S 'election(S2, L, [1, IS2ID, and (Ls 1 ,Rs 1 , Ls 2 , Rs 2 ) = Collision- 
Detection(Si,Ui, S2, ,112)- Then 

i. With probability at most t^ttj the left rough boundary Ls 1 has at most do logn distance from 
LB(Si) and the left rough boundary Lg 2 has at most dologn distance from LB(S < 2). 

ii. With probability at most ^s, the right rough boundary Rs 1 has at most do log n distance from 
RB(S'i); and the right rough boundary of Rs 2 has at most dologn distance from RB^)- 

Proof: In this case, every position in the two sequences S\ and S2 is selected by Point- 
Selection(S'i , S2). It follows from Lemma 44. I 

Lemma 48. For the case algorithm-type =DETERMINISTIC-SUPERQUADRATIC, we have 
i. CollisionDetection(Si,Ui, S 2 ,U2) takes t(n, \\Ui\\ + \\U 2 \\) = 0((\\Ui\\ + \\U 2 \\) 2 logn) time, 
ii. P oint-S election (S 1, L, [1, \S\\]) selects s(n,L) = O(n) positions in g(n,L) = 0(n) time. 
Hi. Ht/5/. || + ||t/s"|| in the algorithm Recover-Motif is no more than f(n, \G\) = 0(\G\ + n). 

iv. With probability at most 9^-3-, the algorithm Recover-Motif does not stop (0(k(n 2 (logn)°^ 1 ' + 
h 2 logn)),0{k)). 

Proof: Statement i. The parameter wdeterministic-SUPERQUADRATIC is set to be f3 in the 
Collision-Detection. It follows from the time complexity of brute force method. 

Statement ii. They follows from the implementation of Point-SelectionQ. 

Statement iii. It follows from the choice of Point-Selection(.) for the sublinear time algorithm 
at Recover- Motif (.). 

Statement iv. It follows from Lemma 47, Lemma 33 and Statements i, ii, and iii. I 

We give the proof for Theorem 6. 
Proof: [Theorem 6] The computational time part of this theorem follows from Lemma 48. 

By Lemma 47, we let ?i(n) = ^^ ^ ?o for the probability bound ft(n) in the condition (i) of 
Lemma 31. 

By Lemma 47, we can let ^(n) = ^zs < ^0 f° r the probability bound ?2(n) in the condition (i) 
of Lemma 31. 

By inequality (12), the condition (iii) of Lemma 31 is satisfied. 

By inequality (11), we know that the condition (iv) of Lemma 31 can be satisfied. 

The failure probability part of this theorem follows from Lemma 20, and Lemma 31 by using 
the fact that k\, /C2, and k are of the same order (see equation (18)). I 
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6. Conclusions 

We develop an algorithm that under the probabilistic model. It finds the implanted motif with high 
probability if the alphabet size is at least 4, the motif length is in [(logn) 7+At , n ^u+m ] an d each 
character in motif region has probability at most q n )2+ t i of mutation. The motif region can be 
detected and each motif character can be recovered in sublinear time. A sub-quadratic randomized 
algorithm is developed to recover the motif with 0(1) mutation rate. A quadratic deterministic 
algorithm is developed to recover the motif with 0(1) mutation rate. It is interesting problem if 
there is an algorithm to handle the case for the alphabet of size 3. A more interesting problem is 
to extend the algorithm to handle larger mutation probability. 
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7. Experimental Results 

7.1. Implementation and Results 

Aiming at solving the motif discovery problem, we implemented our algorithm by JAVA. Our 
program could accept many popular DNA sequence data formats, such as FASTA,GCG, GenBank 
and so on. Our tests were all done on a PC with an Intel Core 1.5G CPU and 3.0G Memory. 

In the first experiment, we tested our algorithm on several sets of simulated data, which are all 
generated from our probability model with a small mutation rate. All input sets contain 20 or 15 
sequences, each of length is 600 or 500 base pair. And each bp of all the simulated gene sequences 
was generated independently with the same occurrence probability. A motif with a length of 15 or 
12 was randomly planted to each input sequence. The number of iterations is between 10 and 30. 
The minimum Hamming distances between the results and consensus are recorded. 





N 


M 


L 


R 


Accuracy 


timecost 


Setsl 


20 


600 


15 


10 


100 


23 


Setsl 


20 


600 


15 


30 


100 


85 


Sets2 


15 


600 


15 


10 


95 


18 


Sets3 


20 


600 


12 


10 


95 


15 


Sets4 


20 


500 


15 


20 


100 


112 



Tab 1. Results on simulated data 



In the second experiment, we tested our algorithm on real sets of sequences, which are obtained 
from SCPD. SCPD contains a large number of gene data and transcription factors of yeast. For 
each set of gene sequences that are regulated by the same motif, we chose lOOObp as the length of 
input gene sequence. In order to make comparisons among several existed motif finding methods, 
we also tested Gibbs, MEME, Info-Gibbs and Consensus in our experiment to show the difference 
of their performance. Here are the specific experimental results: 





Number of Sequences 


Motif Length 


GCR1 


6 


10 
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Tab 2. Number of sequences and motif length 





GCR1 


Voting 


4 


Gibbs 


5 


MEME 


10 


InfoGibbs 


5 


Consensus 


5 



Tab 3. Total number of mismatch positions compared to motif 





GCR1 


Voting 


0.67 


Gibbs 


0.83 


MEME 


1.67 


InfoGibbs 


0.83 


Consensus 


0.83 



Tab 4. Average mismatch numbers per sequence 



7.2. Analysis 



In the first experiment, we tested our algorithm on 4 sets of simulated sequences. From our 
experimental results, we find that the accuracy of our algorithm for finding motif in simulated data 
is nearly 100%. The accuracies of the experiments in simulated data sets are satisfactory. Our 
algorithm can get the results within several minutes. 

From the second experimental results, we find that our algorithm is able to find the real motifs 
from given gene sequences in little time. Our algorithm shows higher speed than other four motif 
finding methods, because an initial motif pattern is first extracted from comparing two sequences 
in the first stage of our algorithm. In addition, unlike Gibbs sampling and EM methods, our algo- 
rithm could avoid some extra time consuming computations, such as the calculations of likelihoods. 
According to this feature, we use the consensus string of the voting operation obtained from the 
result of last iteration as a new starting pattern to program, and continue doing voting until there 
is no further improvement. Experimental results show that if we set the number of iterations to be 
large enough, the program could give more accurate results in reasonable time. Besides, in order 
to detect unknown motifs in sequences, our program also provides several possible motifs existed 
in specific sequences, and the average mismatch numbers of motifs that is greatly lower than other 
four methods. 

8. Future works 

Compared with other tested motif finding methods, we could find that the voting algorithm has 
advantages in some aspects, but there are still some improvements could be done on this algorithm. 
As we know, though a set of sequences may have the consensus, but each motif in sequence may 
has mutations, and the length of each motif could also be different. So the two factors increase the 
difficulties in finding unknown motifs. In the future, we plan to improve the efficiency of voting 
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algorithm by studying other motif finding methods, such as MEME, a combination may be made 
between voting algorithm and MEME so that voting algorithm could have better performance in 
finding unknown motifs. 
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